Transcribing Multilingual Broadcast News Using Hypothesis Driven Lexical Adaptation
نویسندگان
چکیده
This paper describes first results of our DARPA-sponsored efforts toward recognizing and browsing foreign language, more specifically, Serbo-Croatian broadcast news. For Serbo-Croatian as well as many other than the most common well studied languages, the problems of broadcast quality recognition are complicated by 1.) the lack of available acoustic and language data, and 2.) the excessive vocabulary growth in heavily inflected languages that lead to unacceptable OOV-rates. We present a Serbo-Croatian large vocabulary system that achieves a 74% recognition rate, despite limited training data. Our system achieves this rate by a multipass strategy that dynamically adapts the recognition dictionary to the speech segment to be recognized by generating morphological variations (Hypothesis Driven Lexical Adaptation). We will outline the bootstrapping and training process of the Janus Recognition Toolkit (JanusRTk) based broadcast news recognition engine: data collection, segmentation and labeling of the data according to different acoustic conditions, dictionary design, language modeling and training. The Hypothesis Driven Lexical Adaptation (HDLA) approach has been tested both on Serbo-Croatian and German news data and has achieved considerable recognition improvements. OOV-rates were reduced by 35-45%; on the Serbo-Croatian broadcast news data from 8.7% to 4.8% thereby also decreasing word error rate from 29.5% to 26%.
منابع مشابه
Phonetic-distance-based hypothesis driven lexical adaptation for transcribing multlingual broadcast news
High out-of-vocabulary (OOV) rates are one of the most prevailing problems for languages with a rapid vocabulary growth due to a large number of inflections. Especially when transcribing SerboCroatian and German broadcast news, the OOV-rate is between 8.7% and 4.5%. Hypothesis Driven Lexical Adaptation (HDLA) has already been shown to decrease high OOV-rates significantly by using morphology-ba...
متن کاملSelection criteria for hypothesis driven lexical adaptation
Adapting the vocabulary of a speech recognizer to the utterance to be recognized has proven to be successful both in reducing high out-of-vocabulary as well as word error rates. This applies especially to languages that have a rapid vocabulary growth due to a large number of inflections and composita. This paper presents various adaptation methods within the Hypothesis Driven Lexical Adaptation...
متن کاملAdaptive vocabularies for transcribing multilingual broadcast news
One of the most prevailing problems of large-vocabulary speech recognition systems is the large number of out-of-vocabulary words. This is especially the case for automatically transcribing broadcast news in languages other than English, that have a large number of inflections and compound words. We introduce a set of techniques to decrease the number of out-of-vocabulary words during recogniti...
متن کاملImprovements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, urdu, and Arabic
In this work, Portuguese, Polish, English, Urdu, and Arabic automatic speech recognition evaluation systems developed by the RWTH Aachen University are presented. Our LVCSR systems focus on various domains like broadcast news, spontaneous speech, and podcasts. All these systems but Urdu are used for Euronews and Skynews evaluations as part of the EUBridge project. Our previously developed LVCSR...
متن کاملThe ICSI+ multilingual sentence segmentation system
The ICSI+ multilingual sentence segmentation with results for English and Mandarin broadcast news automatic speech recognizer transcriptions represents a joint effort involving ICSI, SRI, and UT Dallas. Our approach is based on using hidden event language models for exploiting lexical information, and maximum entropy and boosting classifiers for exploiting lexical, as well as prosodic, speaker ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998